This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.
HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).
The cleaning steps demonstrated include:
case_when())replace missing with dealing with cases (all lower, etc) case_when() factors
Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)
You can view the first 50 rows of the the original “raw” dataset below:
In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain often utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE).
In a cleaning pipeline the order of the steps is important. Cleaning steps may include:
Column names are used very often so they need to have “clean” syntax. We suggest the following:
The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).
Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).
names(linelist_raw)
## [1] "row_num" "case_id" "generation" "infection date"
## [5] "date onset" "hosp date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit" "merged_column" "...23"Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).
The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:
case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)replace = argument (e.g. replace = c(onset = “date_of_onset”))Below, the cleaning pipeline begins by using clean_names() on the raw linelist.
# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)
## [1] "row_num" "case_id" "generation" "infection_date"
## [5] "date_onset" "hosp_date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit" "merged_column" "x23"NOTE: The column name “…23” was changed to “x23”.
Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.
Below, a re-name command is added to the cleaning pipeline:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the columns names have been changed:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )You can also rename by column position, instead of column name, for example:
If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).
Merged cells in an Excel file are a common occurrance when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accomodate merged cells. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.
When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.
One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.
linelist_raw <- openxlsx::readWorkbook(here("data", "ebola_simulated.xlsx"), fillMergedCells = TRUE)DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..
Sometimes, you may want to avoid importing a row of data. Using import() from the rio package on a .xlsx file, you can do this with the argument skip =. Provide the number of rows you want to skip.
You may need to avoid importing the second row of data, for example if it is a data dictionary row (as in the example linelist). This can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.
The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.
For Excel files:
# REMOVE 2nd ROW (DATA DICTIONARY)
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)# For excel files
linelist_raw_names <- import("ebola_simulated.xlsx") %>% names()
linelist_raw <- import("ebola_simulated.xlsx", skip = 2, col_names = linelist_raw_names) # argument is 'col_names'For CSV files:
# For csv files
linelist_raw_names <- import("ebola_simulated.csv") %>% names()
linelist_raw <- import("ebola_simulated.csv", skip = 2, col.names = linelist_raw_names) # argument is 'col.names'Backup option - changing column names as a separate command
# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_namesBonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/
TO DO
CAUTION: This tab may follow from previous tabs.
Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.
CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.
Here are all the column names in the linelist:
names(linelist)
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_onset" "date_hospitalisation"
## [7] "date_outcome" "outcome" "gender"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"
## [22] "merged_column" "x23"select() you can do the following:Select only the columns you want to remain, and their order of appearance
# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names() # display the column names
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever"Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
linelist %>%
select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
names()
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital"
## [10] "lon" "lat" "infector"
## [13] "source" "age" "age_unit"
## [16] "merged_column" "x23"Re-order the columns - use everything() to signify all other columns not specified in the select() command:
# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>%
select(case_id, date_onset, date_hospitalisation, gender, everything()) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "gender" "row_num" "generation"
## [7] "date_infection" "date_outcome" "outcome"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"
## [22] "merged_column" "x23"As well as everything() there are several special functions that work within select(), namely:
everything() - all other columns not mentionedlast_col() - the last columnwhere() - applies a function to all columns and selects those which are TRUEstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - columns containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() -any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))Here is an example using where():
select()to the cleaning pipe chain:In the linelist, there are a few columns we do not need: row_num, merged_column, and x23. Remove them by adding a select() command to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# remove 2nd row, which contains data dictionary values
#######################################################
# store column names
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
# import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_column, x23))select() as a stand-alone commandSelect can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.
See the tabs below to add columns and rows
mutate()We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).
The verb mutate() is used to add a new column, or to modify an existing one. Below are some example of creating new columns with mutate(). The syntax is: new_column_name = value or function. It is best practice to separate each new column with a comma and new line.
linelist <- linelist %>% # creating new, or modifying old dataset
mutate(new_var_dup = case_id, # new column = duplicate/copy another column
new_var_static = 7, # new column = all values the same
new_var_static = new_var_static + 5, # you can overwrite a column, and it can be a calculation using other variables
new_var_paste = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
) Scroll to the right to see the new columns:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.
To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:
TO DO
Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.
linelist <- linelist %>%
add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.
If your class is off you will see an error like this: Error: Can’t combine ..1$infection date ..2$infection date as.Date() like as.Date("2020-10-10"))
CAUTION: This tab may follow from previous tabs.
Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html
Taken from website above:
#Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as #soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.CAUTION: This tab may follow from previous tabs.
See section on object classes
Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class.
First we run some checks on the classes of important columns.
The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!
The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!
However, if we try to classify this column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!
##
## 15th April 2014 2012-05-01 2012-06-17 2012-06-21 2012-06-24
## 1 1 1 1 2
## 2012-06-27
## 1
Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).
The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# remove 2nd row, which contains data dictionary values
#######################################################
# store column names
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
# import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_column, x23)) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age))
Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.
class(linelist$date_infection)
## [1] "Date"
head(linelist$date_infection)
## [1] "2014-04-09" NA NA "2014-05-07" NA
## [6] "2014-05-06"You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.
across() we also use the function where().is.character(), is.numeric(), and is.logical()) are from base Racross() are written without the empty parentheses ()blah blah blah TO DO
mutate()mutate() is also used to recode the values in a column. For example, in linelist the values in the column “hospital” must be cleaned. There are several incorrect spellings, and many missing values.
table(linelist$hospital, useNA = "always")
##
## Connaught Hopital
## 48
## Connaught Hospital
## 1757
## Hospital A
## 54
## Hospital B
## 54
## Military Hopital
## 31
## Military Hospital
## 802
## Mitylira Hopital
## 1
## Mitylira Hospital
## 80
## other
## 905
## Princess Christian Maternity Hopital (PCMH)
## 11
## Princess Christian Maternity Hospital (PCMH)
## 421
## Rokupa Hopital
## 11
## Rokupa Hospital
## 452
## <NA>
## 1500recode()To change spellings manually, one-by-one, you can use the recode() function within the mutate function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!
linelist <- linelist %>%
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
))Now we see the values in the hospital column have been corrected:
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1805
## Hospital A
## 54
## Hospital B
## 54
## Military Hospital
## 914
## Other
## 905
## Princess Christian Maternity Hospital (PCMH)
## 432
## Rokupa Hospital
## 463
## <NA>
## 1500TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").
If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new column by assigning it a value. In the command below, the column new_var does not exist until after the command is executed. In this simple example the column is assigned the static value “new value”, so for all rows the value will be “new value”.
If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!
Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.
case_when()If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!
Note that all Right-hand side (RHS) inputs must be of the same class (e.g. character, numeric, logical). Notice the use of the special value NA_real_ instead of just NA.
ifelse() and if_else()For simple uses of logical re-coding or new variable creationgyou can use ifelse() or if_else(). Though in most cases it is better to use case_when().
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if condition evaluates to TRUE, value if condition evaluates to FALSE). If used in a mutate(), each row is evaluated. if_else() is a special version from dplyr that handles dates in the condition.
It can be tempting to string together many ifelse commands… resist this and use case_when() instead! It is much more simple, easier to read, and easier to identify errors.
IMAGE of ifelse string with X across is.
You can reference other columns with the ifelse() function within mutate():
Example of ifelse():
Example of if_else() (using dates): Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.
Note: If you want to alternate a value used in your code based on other circumstances, consider using switch() from base R. For example if… TO DO. See the section on using switch() in the page on R interactive console.
replace_na()To change missing values (NA) to a specific character value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().
na_if()Likewise you can quickly convert a specific character value to NA using na_if(). The command below is the opposite of the one above. It converts any values of “Missing” to NA.
coalesce()This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.
For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.
TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),
CAUTION: This tab may follow from previous tabs.
## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
filter(column %in% c(names(mll_raw), ".global"))
## define columns that are not cleand
unchanged <- c(
"epilink_relationship",
"narratives",
"epilink_relationship_detail"
)
mll_clean <- mll_raw %>%
## convert to tibble
as_tibble() %>%
## clean columns using cleaning rules
clean_data(
wordlists = mll_cleaning_rules,
protect = names(.) %in% unchanged
)Here we add the described cleaning steps to the pipe chain.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# remove 2nd row, which contains data dictionary values
#######################################################
# store column names
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
# import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_column, x23)) %>%
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_))CAUTION: This tab may follow from previous tabs.
A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.
The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R
filter()A simple example applies a filter command within a pipe chain. The command re-defines the dataframe linelist as itself having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.
In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.
Before the filter is applied, the number of rows in linelist is 6127.
After the filter is applied, the number of rows in linelist is 6124.
filter()A more complex example using filter():
Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.
#### Be aware how filters handle missing numeric and date values
Can we just filter by onset_date to rows after June 2013? Caution! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!
DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.
What other criteria can we filter on to remove the first outbreak from the dataset? We also happen to know that this first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to cross-tabulate these columns to know exactly how many rows we expect should be removed.
Let’s examine a cross-tabulation to make sure we exclude only the correct rows:
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2012 2013 2014 2015 <NA>
## Connaught Hospital 8 1 1377 339 80
## Hospital A 34 18 0 0 2
## Hospital B 39 12 0 0 3
## Military Hospital 0 0 676 199 39
## Missing 0 0 1129 306 62
## Other 0 0 687 176 42
## Princess Christian Maternity Hospital (PCMH) 0 0 327 90 15
## Rokupa Hospital 0 0 347 97 19
## <NA> 0 0 0 0 0We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our filter statement:
linelist <- linelist %>%
filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
nrow(linelist)
## [1] 6007When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2014 2015 <NA>
## Connaught Hospital 1377 339 80
## Military Hospital 676 199 39
## Missing 1129 306 62
## Other 687 176 42
## Princess Christian Maternity Hospital (PCMH) 327 90 15
## Rokupa Hospital 347 97 19
## <NA> 0 0 0Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity. Adding these filters to the cleaning pipe chain now looks like this:
Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.
# dataframe <- filter(dataframe, condition(s) for rows to keep)
linelist <- filter(linelist, !is.na(case_id))You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.
# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)
linelist <- linelist[!is.na(case_id), ]TIP: Use bracket-subset syntax with View() to quickly review a few records.
This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:
View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])Note: the above command can also be written with dplyr verbs filter() and select() as below:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# remove 2nd row, which contains data dictionary values
#######################################################
# store column names
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
# import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_column, x23)) %>%
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_)) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
filter(
# keep only rows where case_id is not missing
!is.na(case_id),
# also filter to keep only the second outbreak
date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))CAUTION: This tab may follow from previous tabs.
Special approaches for creating numeric categories
Common examples include age categories, groups of lab values, etc.
There are several ways to create categories of a numeric column such as age. Here we will discuss:
cut(), from base Rage_categories(), from the epikit packageSometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years).
For this example we will create an age_cat column using the age_years column.
cut()You can use the base function cut(), which creates categories from a numeric variable.
The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is automatically a ordered factor.
If used within mutate() (a dplyr verb) it is not necessary to specify the dataset before the column name (e.g. linelist$age_years).
Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points:
c(0, 5, 10, 15, ...)cut() is that lower break values are excluded from each category, and upper break values are includedinclude.lowest = TRUElabels = argumentlinelist <- linelist %>%
mutate(age_cat = cut(age_years, # numeric column
breaks = c(0, 5, 10, 15, 20,
30, 50, 70, 100), # break points for categories
# default: lower breaks excluded and upper included
include.lowest = TRUE, # include 0 in lowest category
labels = c("0-5", "6-10", "11-15", "16-20", # manual labels - be careful!
"21-30", "31-50", "51-70", "71-100")))
table(linelist$age_cat, useNA = "always")
##
## 0-5 6-10 11-15 16-20 21-30 31-50 51-70 71-100 <NA>
## 1353 1089 985 814 1048 600 27 1 90Below is a detailed description of the behavior of using cut() to make the age_cat column:
The most basic, simple version of cut() applied to age_years to make the new variable age_cat:
# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>%
mutate(age_cat = cut(age_years, breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100)))
# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
##
## (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100]
## 1230 1089 985 814 1048 600 27 1
## <NA>
## 213By default, the grouping occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). You can reverse this behavior by providing the right argument and setting equal to FALSE (explanation below).
Check your work!!! Verify that each age value was assigned to the correct category. Note how above, 213 rows are categorized as NA. Should all of these be assigned NA? Check by cross-tabulating the age_years and age_cat columns (names added for clarity).
You can also see in the cross-tabulation below that “0” values were assigned as missing (NA)! This is because the command was by default “closed” on the lower side of each category. These “0” values could be infants coded as age 0. Read more how to address this situation below.
Also check category boundary values are assigned to. For example the 213 15-year-olds. Currently they are in the “(10,15]” category which includes ages 11-15, and they are not in the "(15,20] category which includes ages 16-20.
# Cross tabulation of the numeric and category columns. Names specified for clarity.
table("Numeric Values" = linelist$age_years,
"Categories" = linelist$age_cat,
useNA = "always")
## Categories
## Numeric Values (0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70]
## 0 0 0 0 0 0 0 0
## 0.166666666666667 2 0 0 0 0 0 0
## 0.25 2 0 0 0 0 0 0
## 0.333333333333333 5 0 0 0 0 0 0
## 0.416666666666667 2 0 0 0 0 0 0
## 0.5 2 0 0 0 0 0 0
## 0.583333333333333 1 0 0 0 0 0 0
## 0.75 1 0 0 0 0 0 0
## 0.833333333333333 3 0 0 0 0 0 0
## 0.916666666666667 3 0 0 0 0 0 0
## 1 254 0 0 0 0 0 0
## 1.5 1 0 0 0 0 0 0
## 2 252 0 0 0 0 0 0
## 3 234 0 0 0 0 0 0
## 4 225 0 0 0 0 0 0
## 5 243 0 0 0 0 0 0
## 6 0 213 0 0 0 0 0
## 7 0 240 0 0 0 0 0
## 8 0 210 0 0 0 0 0
## 9 0 208 0 0 0 0 0
## 10 0 218 0 0 0 0 0
## 11 0 0 203 0 0 0 0
## 12 0 0 192 0 0 0 0
## 13 0 0 189 0 0 0 0
## 14 0 0 188 0 0 0 0
## 15 0 0 213 0 0 0 0
## 16 0 0 0 161 0 0 0
## 17 0 0 0 185 0 0 0
## 18 0 0 0 171 0 0 0
## 19 0 0 0 149 0 0 0
## 20 0 0 0 148 0 0 0
## 21 0 0 0 0 143 0 0
## 22 0 0 0 0 132 0 0
## 23 0 0 0 0 122 0 0
## 24 0 0 0 0 124 0 0
## 25 0 0 0 0 117 0 0
## 26 0 0 0 0 90 0 0
## 27 0 0 0 0 100 0 0
## 28 0 0 0 0 83 0 0
## 29 0 0 0 0 66 0 0
## 30 0 0 0 0 71 0 0
## 31 0 0 0 0 0 78 0
## 32 0 0 0 0 0 49 0
## 33 0 0 0 0 0 59 0
## 34 0 0 0 0 0 54 0
## 35 0 0 0 0 0 52 0
## 36 0 0 0 0 0 39 0
## 37 0 0 0 0 0 39 0
## 38 0 0 0 0 0 32 0
## 39 0 0 0 0 0 31 0
## 40 0 0 0 0 0 28 0
## 41 0 0 0 0 0 24 0
## 42 0 0 0 0 0 19 0
## 43 0 0 0 0 0 24 0
## 44 0 0 0 0 0 17 0
## 45 0 0 0 0 0 11 0
## 46 0 0 0 0 0 9 0
## 47 0 0 0 0 0 8 0
## 48 0 0 0 0 0 15 0
## 49 0 0 0 0 0 9 0
## 50 0 0 0 0 0 3 0
## 51 0 0 0 0 0 0 3
## 52 0 0 0 0 0 0 6
## 53 0 0 0 0 0 0 2
## 54 0 0 0 0 0 0 4
## 55 0 0 0 0 0 0 2
## 57 0 0 0 0 0 0 2
## 58 0 0 0 0 0 0 1
## 59 0 0 0 0 0 0 2
## 60 0 0 0 0 0 0 1
## 61 0 0 0 0 0 0 2
## 62 0 0 0 0 0 0 1
## 64 0 0 0 0 0 0 1
## 72 0 0 0 0 0 0 0
## <NA> 0 0 0 0 0 0 0
## Categories
## Numeric Values (70,100] <NA>
## 0 0 123
## 0.166666666666667 0 0
## 0.25 0 0
## 0.333333333333333 0 0
## 0.416666666666667 0 0
## 0.5 0 0
## 0.583333333333333 0 0
## 0.75 0 0
## 0.833333333333333 0 0
## 0.916666666666667 0 0
## 1 0 0
## 1.5 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
## 11 0 0
## 12 0 0
## 13 0 0
## 14 0 0
## 15 0 0
## 16 0 0
## 17 0 0
## 18 0 0
## 19 0 0
## 20 0 0
## 21 0 0
## 22 0 0
## 23 0 0
## 24 0 0
## 25 0 0
## 26 0 0
## 27 0 0
## 28 0 0
## 29 0 0
## 30 0 0
## 31 0 0
## 32 0 0
## 33 0 0
## 34 0 0
## 35 0 0
## 36 0 0
## 37 0 0
## 38 0 0
## 39 0 0
## 40 0 0
## 41 0 0
## 42 0 0
## 43 0 0
## 44 0 0
## 45 0 0
## 46 0 0
## 47 0 0
## 48 0 0
## 49 0 0
## 50 0 0
## 51 0 0
## 52 0 0
## 53 0 0
## 54 0 0
## 55 0 0
## 57 0 0
## 58 0 0
## 59 0 0
## 60 0 0
## 61 0 0
## 62 0 0
## 64 0 0
## 72 1 0
## <NA> 0 90One solution to categorize the “0” values in the lowest category is to add the argument include.lowest = TRUE. Then, any “0” values are still included in the lowest group. Note how the total for the lowest group changed, and that the automatically-generated label for the lowest category changes from “(0,5]” to “[0,5]”, which signifies that 0 values are included. Also note there are still 90 rows missing an age_years value, as expected.
linelist <- linelist %>%
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100), # same breaks
include.lowest = TRUE)) # Includes lowest value (0) in lowest group
table(linelist$age_cat, useNA = "always")
##
## [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100]
## 1353 1089 985 814 1048 600 27 1
## <NA>
## 90Customized category labels** can be added manually with the labels argument. Be careful in your manual labeling! Pay attention to your inclusion/exclusion criteria!
linelist <- linelist %>%
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100), # same breaks
include.lowest = TRUE, # Includes lowest value (0) in lowest group
labels = c("0-5", "6-10", "11-15", "16-20", # manual labels - be careful!
"21-30", "31-50", "51-70", "71-100")))
table(linelist$age_cat, useNA = "always")
##
## 0-5 6-10 11-15 16-20 21-30 31-50 51-70 71-100 <NA>
## 1353 1089 985 814 1048 600 27 1 90Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.
NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the include.lowest will now apply to the highest break point value and category, not the lowest.
linelist <- linelist %>%
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100), # same breaks
right = FALSE, # include each *lower* break point
labels = c("0-4", "5-9", "10-14", "15-19",
"20-29", "30-49", "50-69", "70-100"))) # now the labels must change
table(linelist$age_cat, useNA = "always")
##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70-100 <NA>
## 1110 1114 990 879 1125 668 30 1 90NA values with cut()Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified age_cat as a Factor (a rigid column class with specific value labels).
First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (“Missing”). Otherwise you will encounter an error. Then use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These can be combined into one step, as below.
Note that Missing has been added, but the order of the categories is now wrong (alphabetical).
linelist <- linelist %>%
# cut() creates age_cat, automatically of class Factor
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),
right = FALSE,
labels = c("0-4", "5-9", "10-14", "15-19",
"20-29", "30-49", "50-69", "70-100")),
# convert to class Character, and replace NA with "Missing"
age_cat = replace_na(as.character(age_cat), "Missing"))
table(linelist$age_cat, useNA = "always")
##
## 0-4 10-14 15-19 20-29 30-49 5-9 50-69 70-100 Missing <NA>
## 1110 990 879 1125 668 1114 30 1 90 0To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.
linelist <- linelist %>%
# cut() creates age_cat, automatically of class Factor
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),
right = FALSE,
labels = c("0-4", "5-9", "10-14", "15-19",
"20-29", "30-49", "50-69", "70-100")),
# convert to class Character, and replace NA with "Missing"
age_cat = replace_na(as.character(age_cat), "Missing"),
# re-classify age_cat as Factor, with correct level order and new "Missing" level
age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
"30-49", "50-69", "70-100", "Missing")))
table(linelist$age_cat, useNA = "always")
##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70-100 Missing <NA>
## 1110 1114 990 879 1125 668 30 1 90 0DANGER: If you provide a highest break value that is too low, values may be excluded accidentally! You can write code that automatically adapts, by replacing a static highest number with the max() function.
Don’t forget to include the na.rm = TRUE argument to max().
linelist <- linelist %>%
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, max(age_years, na.rm=TRUE)),
include.lowest = TRUE))
table(linelist$age_cat, useNA = "always")
##
## [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,72] <NA>
## 1353 1089 985 814 1048 600 27 1 90If you want a fast way to make breaks and labels, you can use something like below (adjust to your specific situation). See the page on using seq() and rep() and c() TO DO
# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq
# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels
# check that both vectors are the same length
length(age_seq) == length(age_labels)
# # Use them in the cut() command
# cut(linelist$age, breaks = age_seq, labels = age_labels)case_when()The dplyr function case_when() can also be used to create numeric categories.
NA values in one stepIf using case_when() please review the in-depth page on it, as the logic and order of assignment are important understand to avoid errors.
CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (“Missing”, or the special NA_character_ instead of NA).
You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.
linelist <- linelist %>%
mutate(age_cat = factor(case_when(
# provide the case_when logic and outcomes
age_years >= 0 & age_years < 5 ~ "0-4", # logic by age_year value
age_years >= 5 & age_years < 10 ~ "5-9",
age_years >= 10 & age_years < 15 ~ "10-14",
age_years >= 15 & age_years < 20 ~ "15-19",
age_years >= 20 & age_years < 30 ~ "20-29",
age_years >= 30 & age_years < 50 ~ "30-49",
age_years >= 50 & age_years < 70 ~ "50-69",
age_years >= 45 & age_years <= 100 ~ "70-100",
is.na(age_years) ~ "Missing", # if age_years is missing
TRUE ~ "Check value" # catch-all alarm to trigger review
), levels = c("0-4","5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100", "Missing", "Check value"))
)
table(linelist$age_cat, useNA = "always")
##
## 0-4 5-9 10-14 15-19 20-29 30-49
## 1110 1114 990 879 1125 668
## 50-69 70-100 Missing Check value <NA>
## 30 1 90 0 0Below, code to create two categorical age columns is added to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# remove 2nd row, which contains data dictionary values
#######################################################
# store column names
linelist_raw_names <- import(here::here("data", "ebola_simulated.xlsx")) %>% names()
# import raw dataset, skipping 2nd row (data dictionary row) and re-attaching column names
linelist_raw <- import(here::here("data", "ebola_simulated.xlsx"), skip = 2, col_names = linelist_raw_names)
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_column, x23)) %>%
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_)) %>%
filter(
# keep only rows where case_id is not missing
!is.na(case_id),
# also filter to keep only the second outbreak
date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
mutate(
# age categories column
age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20,
30, 50, 70, max(age_years, na.rm = T)), # breaks
include.lowest = TRUE, # include 0s in lowest category
labels = c("0-5", "6-10", "11-15", "16-20", # manual labels - be careful!
"21-30", "31-50", "51-70", "71+")),
# column with 5-year age bins
age_cat5 = cut(age_years,
breaks = seq(0, 100, 5),
include.lowest = T))
CAUTION: This tab may follow from previous tabs.
Within a group, indicate/convert to the highest value in the group
Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)
https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html
CAUTION: This tab may follow from previous tabs.
across dplyr
A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).
across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.
across() with mutate():Change all columns to character class
#to change all columns to character class
linelist <- linelist %>%
mutate(across(everything(), as.character))Change only numeric columns ```
Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale